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Testing Cluster Structure of Graphs 


Artur Czumaj* Pan Peng^ Christian Sohler^ 


Abstract 

We study the problem of recognizing the cluster structure of a graph in the framework of 
property testing in the bounded degree model. Given a parameter e, a d-bounded degree graph 
is defined to be (fc, (j))-clusterable, if it can be partitioned into no more than k parts, such that 
the (inner) conductance of the induced subgraph on each part is at least (j) and the (outer) 
conductance of each part is at most , where Cd,k depends only on d, k. Our main result 

is a sublinear algorithm with the running time 0{-\/n-poly{(j}, k, 1/e)) that takes as input a graph 
with maximum degree bounded by d, parameters k, (j), e, and with probability at least accepts 
the graph if it is (fc, ((i)-clusterable and rejects the graph if it is e-far from {k, ())*)-clusterable for 

,2 4 

= ^d kuign’ ■'^here c'^ ^ depends only on d^k. By the lower bound of on the number 

of queries needed for testing graph expansion, which corresponds to = 1 in our problem, our 
algorithm is asymptotically optimal up to polylogarithmic factors. 
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1 Introduction 


Cluster analysis is a fundamental task in data analysis that aims to partition a set of objects into 
maximal subsets (called clusters) of similar objects. In graph clustering, the objects to be clustered 
are the vertices of a graph and the edges of a graph describe relations between them. These relations 
may have interpretations for data analysis. For example, if the graph is the friendship graph of 
a social network, i.e., the vertices are the users of a social network and the edges correspond to 
friendship relations, edges may indicate that the users are socially related and/or have similar 
interests. In a co-author graph, where the vertices are authors and edges describe co-authorships, 
edges may be interpreted as a sign that the authors work in the same scientific community. A 
cluster is then a maximal subset of vertices that are well-connected to each other, where the precise 
meaning of being well-connected can be defined in various ways. 

In many cases, once we know the interpretation of a single edge, there is a natural interpretation 
of clusters. For example, clusters in a friendship graph correspond to social groups or clusters in a 
co-author graph correspond to scientific communities. For similar reasons, a vast amount of graph 
clustering methods are applied to many different kinds of social/information/biological networks 
to reveal hidden cluster structure, etc. (see, e.g., surveys | ForlO| , pOM09| , |Sch07| ). 

Many efficient algorithms for finding clusters in a graph have been developed. However, with 
the increasing focus on the study of very large networks, we have to concentrate on new features 
of the clustering algorithms. For example, if one tries to find clusters in the World Wide Web 
or in a big social network, even linear time algorithms might be too slow. This is particularly 
important if one wants to study the temporal development of the clusters, which require to solve 
the problem on many instances (each for a different point of time). In such cases, we need sublinear 
time algorithms. We develop such an algorithm in this paper. Our algorithm can be used to test, 
if a given graph has a cluster structure, i.e., is composed of at most k clusters. 

We will develop the algorithm in the framework of Property Testing for bounded degree graphs 
I1GR02 ]. In this framework, an algorithm has oracle access to an undirected graph G = (y,E) with 
a bound d on the maximum degree, with d typically assumed to be constant. An algorithm is 
called a property tester for a given property H (in our case, the property of all graphs that have 
a cluster structure with at most k clusters), if it accepts with probability at least | every graph 
that has the property H and rejects with probability at least | every graph that is e-far from H. 
Here the notion of e-far means that one has to change more than edn edges to obtain a graph of 
maximum degree d that has property H. If G is not e-far from H, then it is called e-close. To give 
a property tester on a bounded degree graph G, we assume that G is given as an oracle, which 
allows us to perform neighbor queries to G such that for any input pair {v,i), the oracle returns 
the ith neighbor of vertex v \i i < dciv), and a special symbol if i > dciv), where dciv) is the 
degree of v. This framework of graph property testing was initiated by Goldreich and Ron |GR02|. 
In this model, it is known that several properties are testable in constant time, such as hyperhnite 
properties [NS13| (see also |CGR^'H4, GR02| and the references therein). We now also know that 
properties such as bipartiteness | GR98 | and expansion | CS1C , |GR0C| , KSll , NSlOj] are testable in 
time 0{y/n), with a nearly matching lower bound, and we need to perform at least n(n) queries to 
test 3-colorability | GR02 |. For more results, see recent surveys | Golll| , [RonlC |. 

There are several ways to assess the cluster structure of a graph, such as /c-means, cliques, 
modularity etc. One typically would want to argue that vertices in the same cluster should be 
well-connected and vertices from different clusters should be poorly-connected. In this paper, we 
use the concept of conductance to measure the quality of the cluster structure of a graph. Given 
a graph G = iV, E) with maximum degree bounded by d, and a subset S C V, the conductance 
of S is defined as 4>g{S) := ; where e{S,V \ S) denotes the number of edges coming out 
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of S. Note that 4>g{V) = 0. The conductance of the graph G, denoted as 4>{G), is defined as the 
minimum conductance value over all possible subsets S' of 1^ with \S\ < |l^|/2. (For convenience, 
we define 4>{G) = ^ if G is the singleton graph, that is, the graph consisting of a single isolated 
vertex with no edges.) For any S CV, let G[S] be the induced subgraph of G on the vertex set S. 
Define the inner conductance of S to be the conductance of subgraph ^[5"], namely, (/)(G[S']). To 
avoid confusion, we will also call the conductance 4>g{S) of S in G the outer conductance of S. 

Kannan et al. [KVV04| introduced conductance as an appropriate measure of the quality of a 
cluster and this notion has been later used in numerous more applied works (see, e.g., [|Sch07 |). 
Further intuition has been employed to assert that a set S with small outer conductance has few 
connections to the outside of S, and a graph G with large conductance means that the vertices of G 
are well-connected with each other. Following this intuition, Oveis Gharan and Trevisan | OGT14 | 
and Zhu et al. |ZLM13| proposed to combine both outer conductance and inner conductance of a 
set S to measure whether 5" is a good cluster or not. That is, a set S is considered to be a good 
cluster if (I)g{S) is small and (/>(G[S']) is large. In (pGT14 ], a graph G is defined to be clusterable 
if G can be partitioned into a number of disjoint parts so that each of them is such a good cluster. 
In this paper, we will use a related definition to characterize graphs with cluster structure. 


1.1 Our results 


We begin with the formal definition characterizing graphs with a cluster structure and state our 
main results. The following definition is inspired by the work of Oveis Gharan and Trevisan 
IOGT14I . 

Definition 1.1. For a d-degree bounded undirected graph G = {V, E) with n vertices and parameters 
k,4>,e, we define G to he (/c, (/))-clusterable if there exists a partition of V into h sets Gi,...,Gh 
such that 1 < h < k, and for each i, I < i < h, 0(G[Gj]) > and fiG{Ci) < Cd^k^^4>^, where for fixed 
d, k, Cd,k is a universal constant. We call each Gi a (;i>-cluster and the corresponding h-partition an 
{h, (/))-clustering. 


The above definition formalizes the idea that the existence of an edge is an indicator that 
two vertices are similar, i.e., two persons are friends or two authors belong to the same scientific 
community, while the lack of an edge is a (weaker) sign of the opposite statement. Therefore, a 
cluster should be, intuitively, well-connected in the inside and poorly-connected to the outside. (We 
remark that the gap between the conductance of Gj and G[Gi] in Definition IT is a feature of our 
approach rather than an inherent property of the problem.) 

In this paper, we develop an algorithm that with probability at least |, accepts every {k,cj))- 
clusterable graph and rejects every graph that is e-far from every (/c, (/)*)-clusterable graph, where 
= Od,fc(^^). (Throughout the paper we use the notation Od^kO to describe a function in the 
Big-Oh notation assuming that d and k are constant.) Our main result is that we can distinguish 
such a clusterable graph from all graphs that are far from being clusterable in sublinear time. 


Theorem 1.2. Let he a suitable constant depending on d and k. There exists an algorithm 
that accepts every {k, fi)-clusterable graph of maximum degree at most d with probability at least |, 
and rejects every graph of maximum degree at most d that is e-far from being {k,(p*)-clusterable 
with probability at least |, if (p* < f^dktign' running time of the algorithm is ^{klogn/e)^^^h 

One can question whether the gap between and (p in the form (p* = Od,A;(^^) or similar is 
really required. We believe that for an algorithm with a somewhat similar time complexity, both 
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the log n and the e factors in the gap between cf) and (j)* are necessary. For further discussion about 


this gap size we refer to Section . 

Note also that in our results we allow for clusterings with at most k clusters (rather than with 
exactly k clusters). This can be justified by the fact that in the property testing framework, every 
(fc, ()))-clusterable graph with exactly h < k clusters is e-close to some {k, (^*)-clusterable graph with 
exactly k clusters, for any reasonable choice of parameters (one can simply remove all edges that 
are incident to k — h vertices). 


1.2 Comparison with testing expansion and discussion of the gap size 

For k = 1, our problem is equivalent to that of testing graph expansion, the problem which has 
received significant attention in the past. Goldreich and Ron ||GR0C ] were the first to study this 
problem in details and proved a lower bound fl(y^) on the number of queries for testing graph 
expansion in the bounded degree model. This result has been complemented by a proposed algo¬ 
rithm, which Goldreich and Ron conjectured to be a property tester for the second largest eigenvalue 
(denoted by 772 ) of the normalized adjacency matrix of a regularized version of the graph, in the 
sense that it accepts every graph with 772 < h and rejects every graph that is e-far from having 
772 < 77 ®^^) for any // > 0. Note that by Cheeger’s inequality (cf. Theorem A.3), resolving of this 
conjecture would imply that the algorithm is also a property tester that accepts any graph with 
())(G) > (j) and rejects every graph that is e-far from being a ( 7 i*-expander for (j)* = 0{p,4P‘), where 
a graph G is called a (/i-expander if (t>{G) > cj). Czumaj and Sohler | CS10| proved a weaker version 
of this conjecture by showing that the algorithm from | GROOj ] can distinguish in time 0{y/n) any 
0-expander graph from graphs that are e-far from being a 0*-expander for 0* = 0 ( 77 ^)• Kale 
and Seshadhri [|KS11 ] and Nachmias and Shapira | NS1[1|] extended this result and proved that in 
Q(j.j0.5+/i) algorithm accepts graphs with expansion 0 and reject graphs which are e-far 

from having expansion 0* = O(/i0^). 

Since the best known methods require a gap between 0 and 0* already for the special case k = 1, 
it is clear that our work will also need a similar gap. R seems to be tempting to conjecture that 
— similarly to the case of testing expansion — it will suffice to reject (in the soundness) graphs 
that are e-far from being (A:, 0 (/ 70 ^))-clusterable for any 77 > 0, instead of having a log 77 factor 
dependency between 0 and 0*, as in our result. However, we do not think that this is possible 
and in the following we briefly sketch the differences from testing expansion and argue why the 
approach that led to a better gap for testing expansion is likely to fail (of course, this does not rule 
out other approaches, but this points to substantial obstacles to obtain an improved result). 

Let u, V be any two vertices in the graph G, which for simplicity is now assumed to be d-regular 
and connected. Let Aj be the i-th smallest eigenvalue and Vj be the corresponding eigenvector of 
the (normalized) Laplacian of G. R is known that the lazy random walk on G converges to the 
uniform distribution on its end-vertex. One can write (cf. Section p.2| for details) the Z|-distance 
between the distribution and p:^ of the endpoints of the lazy random walks on G of length £ 
starting at v and u, respectively, as 


A, 


\vi - viWl = “ Vi(7;))2(l - y) 

i=l 


^\2i 


Since a lazy random walk on a regular graph converges to the uniform distribution, we have vi(77) = 
vi(7;) = Xj^fn. Therefore, in the case fc = 1, by the fact that 0 = Ai < A 2 < • • • < A^ < 1, we 
can upper bound ||p^ — plUi by bounding the second smallest eigenvalue and by making a proper 
choice of the length of the walk £. 
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If we want to extend this approach to k > 1, then our definition implies (cf. [|L0T12 |) that in 
a {k, (^)-clusterable graph there is a significant gap between Xh and Xh+i for some h, 1 < h < k, 
where h corresponds to the number of clusters in the instance. Now, assume for simplicity that 
h = k. Then we obtain that 


IPu Pnll2 


K, ,71 , 

^(vi(u) - Vi(n))2(l - ^ (vi(u) - Vi(n))2(l - 

1=1 i=k-\-l 


We can upper bound by using the bound for A/i+i in a similar 

way we can bound the entire term by bounding A 2 in the case k = 1. However, the critical part 
is the first summand. It turns out that there are instances where the average Zo’distance between 

j * 

u,v from the same cluster is for a certain reasonable choice of i, such that the random 

walk mixes well in the cluster while does not escape from some non-expanding set containing the 
cluster too often (for more details, see discussions below and Appendix 0)- This seems to rule 
out an approach similar to | KS11 , NS1C |, as this approach requires a significantly smaller distance 
between and p^. 


1.3 Our techniques 

We develop the first sublinear algorithm for testing if a graph is {k, 0)-clusterable, significantly 
extending earlier works on testing the expansion of a graph. Our algorithm draws a random sample 
set and tests for every pair of sample vertices if the distributions of the endpoints of a random walk 
starting at the two vertices are close in the Z|-distance. If this is the case, then it connects the two 
sample vertices by an edge in a similarity graph. At the end, the algorithm accepts the input graph 
if the similarity graph is a collection of at most k connected components. 

Our main new contributions are as follows. 


Our algorithm is the first property tester that directly makes use of testing pairwise closeness 
of distributions induced by random walks. Previous related algorithms | CS1C , GROO , KPS13| , 
NSIC] tested if the distribution of the endpoints a random walk starting at a vertex 


KSll 


V is close to the uniform distribution and then drew their conclusions about the structure of 
the graph. In our case, we do not know how the distribution looks like (it will be close to 
uniform inside every cluster, but this is not very helpful since the cluster is unknown to us and 
the support size of a distribution is hard to estimate | RRSS09| ) and it may have significant 
distance from the uniform distribution. 


• It is the first property tester that exploits (in the completeness case) a “somewhat stable” 
behaviour of the random walk distribution at a length where it is significantly different from 
the stationary distribution, i.e., we pick the length of the random walk in such a way that 
it is almost stable on its own cluster, and most of the probability mass will stay in some 
non-expanding set containing the cluster. 


In order to test closeness of distribution, we use a recent tester for closeness of distributions 


in / 2 -norm by Chan et al. | CDVV14 ], which gives slightly better bounds than the corresponding 
tester of Batu et al. | BFR+1^ . A combination with a necessary condition on the / 2 -norm of the 
distribution of the endpoints of the random walk from the sample vertices leads to improved bounds. 
It is tempting to think of this problem in the setting of /i-norm since, for example, the distance 
between a random walk starting from different clusters is typically H(l) in /i-norm. But this is 
misleading. It is known that no stable /i-tester exists, i.e., /i-testers cannot distinguish the case 
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that distributions are close from the case that they are not | VV11|] (/i-testers can only distinguish 
between identical (or almost identical) distributions and distributions that are far away from each 
other). However, as already explained in the previous section, we cannot hope for distributions to 
be arbitrarily close even if the random walks start in the same cluster. To address this difficulty, 
we will use the fact (noted earlier by Batu et al. ||BFR^13| ) that an /i-tester can be reduced to 
an Z 2 -tester if the probability of every item is 0{n~^), which is likely to be the case if the graph is 
{k, (^)-clusterable. 

We note that in the Z|-distance, a typical distance between the distribution of the endpoints of 
the random walks starting in two vertices from different clusters can be very small. For example, 
if we have two disconnected expanders (clusters) on n/2 vertices each, then for a sufficiently long 
random walk the distribution of the endpoints of the walk will be (almost) uniform on the cluster 
of the starting vertex. Therefore the distance between the distributions of the endpoints of random 
walks starting in different clusters will be 0(l/n). Furthermore, as we have argued above, the 
distance between the distributions of the endpoints of random walks will not be much smaller in 
the case that they come from the same cluster. Analyzing these two cases is one of the central 
technical challenges of our paper. 


1.4 Other related work 


In the context of property testing, Alon et al. | ADPR03 | studied the problem of testing if a set of 
points in is clusterable (see also [ CS05|]), but both their problem definition and techniques are 
quite different from ours. Kale et al. | KPS13|| gave a sublinear expansion reconstruction algorithm 
that outputs the neighborhood of any input vertex u in a n(j;^)-expander G' that is j^^-close to 
the input graph G, which is assumed to be e-close to a (;i>-expander. In particular, they designed 
an algorithm that runs in 0(-y/n)-time and distinguishes vertices from a large set that induces an 
expander from vertices that belong to a bad cut, by using uniform averaging random walks and 
testing if the distribution of endpoints of the walk is close to uniform distribution (in the h-norm 
distance) or not. This work does not (directly) compare distributions of the endpoints of the 
random walks starting from different vertices, as we do in our paper. 

Our work is closely related to works on testing distributions. Batu et al. | BFR'*~00| , BFR'*~13| 
were the first to give sublinear time algorithms for testing the closeness of two discrete distribu¬ 
tions and since then, a large body of work has been devoted to the problem of estimating the 


properties of distributions from a small number of samples (see the recent survey [Rubl2| and the 
reference therein). In particular, Levi et al. |LRR13| gave an algorithm with complexity 0{v?f^) 
to test whether a set of distributions over a domain of size n can be partitioned into k clusters. 


Very recently, Chan et al. |CDVV14] gave asymptotically optimal testers for the closeness of two 
distributions under both li and I 2 settings. 

Besides the related works in the literature of property testing, our work is also closely related 
to the area of graph partitioning and spectral clustering. Ng et al. | NJW01 | and Shi et al. ||SM00|| 
used the first few eigenvectors of some matrices to partition a graph (or a set of data) into sparsely 
connected clusters. Different ways of measuring clustering based on intra-cluster density vs. inter¬ 
cluster sparsity and some experimental results were given in |BGW07]. Kannan et al. ||KVV04 | 
proposed a bicriteria to measure the quality of a clustering, in which a good clustering is defined to 
be a partition of vertex sets such that each set in the partition has large inner conductance and few 
edges lying between different sets. They gave spectra based approximation algorithm for finding 
such a clustering. Lee et al. [ LOT12 | and Louis et al. [ LRTVl^ ] recently gave theoretical analysis 
of some spectral algorithms that use the first k eigenvectors of the normalized Laplacian matrix 
for finding a fe-partition of a graph such that each part is of small (outer) conductance, without 
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any restriction on the inner conductance of the cluster. Zhu et al. | ZLM13| , OZ14| gave personal 
PageRank based and flow based local algorithms for finding a set of large inner conductance and 
small outer conductance. Makarychev et al. [|MMV12 | studied a semidefinite programming based 
algorithm in the semi-random model to find such a set. Tanaka [ ranl3|] and Oveis Gharan and 


Trevisan | OGT14 | recently studied the existence and construction of a fc-clustering such that each 
cluster is of large inner conductance and of small outer conductance, under the assumption that 
there is some gap between pcik) and pcik + 1), where pcik) is the minimum conductance of any 
k disjoint subsets of the graph (cf. Section ^). Dey et al. |pRS14 ] considered the performance of 
a spectral clustering algorithm that applies a greedy algorithm for fc-centers on some embedding 


induced by the hrst k eigenvectors of the graph Laplacian. Peng et al. [PSZ14| studied the eigen¬ 
vector structures of the Laplacian of well-clustered graphs (which is very related to our definition of 
clusterable graphs) and the approximation ratio of fc-means clustering algorithms on these graphs. 


1.5 Organization of the paper 

In Section we give notations and definitions used throughout the paper. In Section we give a 
formal description of our tester for clusterable graphs. We then present in Section Q some central 
properties, which we use for proving our main result — Theorem |L^. The proofs of these central 
properties are given in Section Section ^ has hnal conclusions. Finally, in Appendix we will 
present some auxiliary tools used in the analysis. 

2 Preliminaries 

Let G = (y, E) be an undirected and unweighted graph with maximum degree bounded by a 
constant d. Let n ;= |R|. For a vertex v £ V, let dciy) be the degree of v. We assume that G 
is represented by its adjacency list and that we can access G through an oracle, which allows us 
to perform the neighbor query to G. That is, when the oracle is given as input a vertex v and an 
integer i, it outputs the i-th neighbor of v if dciv) > i, and a special symbol otherwise (in constant 
time). 

As mentioned in the introduction, we will use Dehnition o of (fc, ()))-clusterable graphs and 
0-clusters inspired by |OGT14] to characterize the cluster structure of graphs and the clusters 
therein. Note that a (1, 0)-clusterable graph is an expander graph with conductance 0, which we 
abbreviate as 0-expander (this should not be confused with 0-cluster). 

We are interested in testing if a given graph is {k, 0)-clusterable in sublinear time in the frame¬ 
work of property testing. Formally speaking, we will study the following problem: given parameters 
k, 0, e, and a d-degree bounded graph G, we want to test if G is {k, 0)-clusterable or e-far from 
being (A:, 0*)-clusterable with as few queries as possible, for 0* being as close to 0 as possible. We 
have the following dehnition of graphs that are e-far from clusterable graphs. 

Definition 2.1. A graph G (of maximum degree at most d) is e-far from (/c, 0)-clusterable if we 
have to add or delete more than edn edges to obtain a (ky)-clusterable graph of maximum degree 
at most d. If G is not e-far from {ky)-clusterable then it is e-close to {ky)-clusterable. 


3 The algorithm 


In this section, we describe our algorithm used in Theorem 1.2. We hrst introduce the following 
random walk on a d-bounded degree graph G that will be used in our algorithm. In this walk, if we 
are currently at vertex u, then in the next step, we choose randomly an incident edge (u, u) with 
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probability ^ and move to u. With the remaining probability, which is at least we stay at v. 
Note that if we let G^eg denote the weighted d-regnlar graph that is obtained from G by adding 
an appropriate nnmber of half-weighted self-loops, then this random walk is exactly a lazy random 
walk on Greg- We will let denote the distribution of endpoints of such a random walk of length 
i starting at v. Our testing algorithm is given as follows. 

fc-Cluster-Test {G, s,i,a,k) 


1. Sample a set S' of s vertices independently and uniformly at random from V. 

2. For any n £ S, let be the distribution of endpoints of random walk of length i 
starting at v. 

3. For any v £ S, test if Hp^Hl > if so, then abort and reject. 

4. For each pair u,v £ S: if I 2 distribution tester accepts that ||p:^ — p^Hl < then 
add an edge (u, v) in “similarity graph” H on vertex set S. 

5. If H is the union of at most k connected components, then accept; otherwise, reject. 


If the graph is (fc, i^)-clusterable then we will show that (for the right choice of parameters) the 
distributions of the endpoints of random walks will be close if they come from the same cluster. 
Furthermore, Step ^ tests a necessary condition for the efficient I 2 distribution tester that will be 
used in Step ^ i.e., Hp^Hl is small, which is satisfied for almost all vertices in a (A:, (/>)-clusterable 
graph. The small Z|-norm property of distributions can then be exploited in the testing for closeness 
of distributions in Step ^ to obtain a better running time. 


3.1 Implementation of distribution testing 


Our algorithm relies on an efficient tester for the / 2 -closeness of two distributions p and q. The 
tester used in Step ^ of /c-Cluster-Test was recently proposed by Chan et al. | CDVV14 ] and is 
similar to the I 2 distance tester in [BFR+lSl that uses the statistics of collisions in the sample sets 
from both distributions p, q. The following is a direct corollary of Theorem 1.2 from ||CDVV14 ]. 


Theorem 3.1. Let cQ he some appropriate constant cO > 1- Let 6,^ > 0 and let p, q be two 
distributions over a set of size n with b > max{||p|| 2 , HqIII}- Let r > ^ In I. There exists an 

algorithm, denoted by I 2 -Distribution-Test, that takes as input r samples from each distribution 
p, q, and accepts the distributions i/ ||p — q||| < ^, and rejects the distributions if \\p — q\\ i > 46 
with probability at least 1 — 5. The running time of the tester is linear in its sample size. 


We also need an efficient algorithm to estimate the /|-norm of the probability distribution of 
the endpoints of a random walk in a graph. In Step ^ of our algorithm /c-Cluster-Test we will use 
/g-norm tester, the performance of which is guaranteed in the following lemma (the proof follows 
almost directly from the proof of Lemma 4.2 in [|CS10[ that in turn is built on Lemma 1 in |GR00|, 
cf. Appendix P for details). 


Lemma 3.2. Let G = {V,E) with \V\ = n. Let v £ V, a > 0 and r > 16y/n. Let t > 1 and 
let p* be the probability distribution of the endpoints of a random walk of length t from v. There 
exists an algorithm, denoted by l^-norm tester, that takes as input r samples from p(, and accepts 
the distribution */ ||p (,||2 < cr/l and rejects the distribution i/ ||p (,||2 > o'; with probability at least 
1 — . The running time of the tester is linear in its sample size. 
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4 Analysis of /c-Cluster-Test 


We outline the proof of our main theorem, Theorem |1.2| . Our techniques are based on two intuitions. 
The first intuition is that if two “typical” vertices u, v are from the same large cluster, then the 
distributions of the endpoints of two sufficiently long random walks starting at u, v, respectively, 
are close; and if u, v are separated by a non-expanding cut, then the distributions of the endpoints of 
two not so long random walks from u, v, respectively, are far away from each other. If this intuition 
holds, then we can reduce our problem to the problem of testing the closeness of two distributions, 
and then use the returned results to decide whether the distributions induced by the random walks 
from different sampled vertices can be divided into k groups or not. In particular, if our input 
graph G is {k, i^)-clusterable, then we can get at most k connected components in our “similarity 
graph” H. (Actually, as will be seen from our proof, sampled vertices from the same cluster form 
a clique in H.) On the other hand, if G is far from being (A:, (^*)-clusterable, then we expect that 
we can get at least A; + 1 connected components in H . The latter is based on our second intuition 
that if G is far from being (A;, (/>*)-clusterable, then there are at least A: + 1 (large) well separated 
sparse cuts. We present several lemmas that formalize these intuitions in Section 0 and then give 
the proof of Theorem 1.2 in Section 4.2. 


4.1 Key properties 

In this section, we state several lemmas describing the properties used in our analysis of A:-Cluster- 
Test. The proofs of the results are deferred to Section |5[ 

In the following we will formally state these key properties under the definition of a more general 
class of clusterable graphs, even though our main focus is on the study of properties of {k, f)- 
clusterable graphs. To study detailed properties of (fc, ()))-clusterable graphs and their dependencies 
on all parameters, we will use the following, more general definition of (A:, (/>oiit)-clusterable 
graphs, which follows the framework from [pGT14 ]. 

Definition 4.1. For an undirected graph G, and parameters k, fin, font, we define G to he {k, fin, fout)- 
clusterable if there exists a partition of V into h subsets Gi,... ,Gh such that 1 < h < k and for 
each i, 1 < i < h, f{G[Gi\) > fin, fciCi) < font- We call each Gi a {fin, fouf-dustei and the 
corresponding h-partition an {h,fin,fout)-c\ViSier\ng. 

We can define a graph G to be e-far from (A:, (^oiit)-clusterable similarly to Definition |2.1| . 
Note that a {k, ()))-clusterable graph from Definition |lI] is exactly a {k, f, Crf^fc£^</>^)-clusterable graph 
from Definition ^T| . 

We first show that if the graph is {k,fin,fout)-clnster:ahle then for any large cluster C with 
f{G[C]) > fin, there exists a large subgraph G such that the distributions of the endpoints of 
two random walks of length large enough starting from any two vertices u,v € G are close in the 
^ 2 -norm (that is, the I 2 distance between and is small). The proof of this result relies on 
spectral properties of clusterable graphs given in Section |5.1| . 

Lemma 4.2. Let 0<a,/3<i. IfG = {V,E) is {k,fin , font) -clusterable, and G FV is any subset 
such that \C\ > (3n and f{G[G]) > fin, then there exists = a^(A:, a, j3, d) and a universal 
constant (Q > 0 such that for any t > > font < A/iere exists a subset G C G with 

\G\ > (1 — 0 ) 1^1 such that for any u,v G G, the following holds: 


\Pu Pv\\2 


1 

< — . 
4n 









In order to use an efficient distribution tester (e.g., as the one given in Theorem 3.1), we need 
to guarantee that for a large fraction of vertices a sufficiently long random walk starting from a 
typical vertex will induce a distribution of its endpoints with small / 2 -norms. We will prove the 
following lemma using spectral analysis of clusterable graphs. 


Lemma 4.3. Let 0 < a < 1. If G is [k,(j)in-,4)out)-clusterable, then there exists V CL Y with 
\V'\ > (1 — a)|I^| such that for any u G V' and any t > for some universal constant 

_ ^ in 

cy > 0 , the following holds: 


I|2 < 
\Pu\\2 ^ 


2k 


an 


Note that the above lemma does not require any assumption about (font) and thus applies 
directly to any {k, i;/))-clusterable graphs by substituting cj) for (pin in the lemma. 

For the soundness of our algorithm, we need the following lemma that shows that given two well 
separated sets A, B <LV, for any two “typical” vertices u G A, v G B, the / 2 -norm of the difference 
between the corresponding distributions of endpoints of random walks of short length starting from 
u, V will be large. Our proof relies on the fact that any set A with small outer conductance has a 
large subset A such that the random walk starting from any vertex in A will stay inside A for a 
relatively long time. 


Lemma 4.4. Let a and ip be arbitrary with 0 < a,^p < 1. Let A CV be any subset of G such that 
<Pg{i^) < 4’- Then for any t > 1, there exists a subset yl C ^4 with |yl| > (1 — a)|^| such that for 
any v G A, the probability that the random walk of length t starting from vertex v never leaves A 
in all t steps is at least 1 — ^ ■ 

Furthermore, for any t, 1 < t < any two disjoint subsets A,B<GV with (Pg{B) < V’j 

and any two vertices u, v such that u G A,v G B, the following holds: 

Wrf - rP l|2 > - 
WPu Pv\\2 — 

n 


Remark. We note that the above lower bound is almost tight up to constants. Consider the graph 
that is composed of two disconnected parts such that each of them is a (/>in-expanders of size n/2. 
Then for any two starting vertices u,v from two different parts, for t = both and p* 

will be very close to the uniform distribution on each cluster, and therefore, the /^ distance between 
these two distributions will be 0 (l/n). 


For the analysis showing that graphs far from clusterable will be rejected, we will use a property 
that if a graph G = {V, E) is e-far from any {k, (;/)*^, (?i)*„j)-clusterable graph, then its vertex set V can 
be partitioned into k + 1 subsets Vi,..., I 4 + 1 , each of linear size and of small outer conductance. 


Lemma 4.5. Let = Q^(c/, k) be a certain constant that depends on d and k. If G = {V, E) is 
e-far from {k,p*^, pint) -clusterable with pl^ < • e, then there exist a partition of V into k -\- 1 

subsets Vi,... , Vk+i such that for each i, 1 <i <k-\-l, \Vi\ > pG{Vi) < 

for some constant = cQ(d, k) and for any 0 < pl^t < 1 - 


4.2 Proof of main result — Theorem 1.2 


We will use Lemmas 4^4^ to prove our main result — Theorem 1.2. In the rest of this section, 
we prove the completeness, soundness and analyze the running time of the tester /c-Cluster-Test. 


9 









^ _ I92sk 


In the algorithm /c-Cluster-Test, we set s = i536fcin(8(fc+i)) ^ ^ _ max{i:^,<^|}-fc logn 
We set r = 192cOsV^^lns = 0{ ^ (infc/e) ^ I = ^! in Theorem and 


set r = 192c^s\/ skn In s and a = —^ in Lemma 

We specify now the constant Cd,k that we used in the definition of a (/>-cluster to be Cd,k = 


ln^(8(/c+l)) 


for a universal constant c. 


4.2.1 Completeness — accepting (/c, (/))-clusterable graphs 

We begin with showing that the algorithm A:- Cluster-Test will accept A:-clusterable graphs. 

Lemma 4.6. If the input graph G is {k, (l))-clusterable, then with probability at least the algorithm 
k-Cluster-Test accepts G. 


Proof. As indicated in the algorithm, we consider random walks of length £. We apply Lemmas 4.2 
to the {k, 0)-clusterable graph G, and we set (fin = 4>, font = Cd^k^^f'^i t = i, a = ^ 


and 


24s’ 


and 


/3 = 2 ^ in the lemmas. Note that by our definition of (/>-cluster, the outer conductance of the cluster 
is at most Cd,k^^4'‘^ < since Cd^k^^ = Q]Z^(fc, 2 ^’'^)’ which implies that the conditions 

of Lemma 4.2 are satisfied for any ^-cluster of size at least (3n in G. Since £ = logn ^ 


we know that the chosen parameters meet all the preconditions in these lemmas. 

Since G is {k, (^)-clusterable, there exists some h, 1 < h < k, and a partition of the vertex 
set of G into h subsets Gi ,..., Gh, such that for every i, 1 < i < h, we have (/)(G[Ci]) > (f and 
(pc{Ci) < For any vertex u, dehne G{v) to be the unique cluster Gi to which v belongs. 

We call a vertex v good if the following three conditions are satisfied: 


1 I|2 ^ 48sfc 

IIPi;ll2 ^ -IT- 

2. |C(u)| > ^n. 

3. V £ G{v), where G{v) C G{v) is dehned as in Lemma by setting G = G{v). 


The success probability of the algorithm depends on the random coins of sampling and random 
walks. We show that with probability at least | (over random coins of sampling), all vertices in the 
sample set S are good; and if all these vertices are good, then our tester will accept with probability 
at least | (over random coins of random walks). Together, this means that with probability at least 
I • I = H > I the tester will accept. This will conclude the proof of the lemma. 

Claim 4.7. With probability at least all vertices in the sampled set S are good. 

Proof. Let v be any vertex that is sampled uniformly at random from V. By Lemma the 
probability that ||p^||| > is at most a = Since there are at most k clusters, the probability 
that V belongs to a cluster of size at most ^iks^ most the probability that v is one of at most 

k ■ vertices in these small clusters, which is In addition, since |C'(u)| > (1 — a)|C'(u)|, the 

probability that v ^ G{v) is at most a = Overall, the probability that v is not good is at 
most ^ ^ ^ By the above analysis and the union bound, with probability at least 

1 — ^ • s = |, all sampled vertices in S are good. □ 

Claim 4.8. Conditioned on the event that all the sampled vertices v £ S are good, our tester will 
accept G with probability at least |. 
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Proof. Let v E S. Since v is good, then ||p :^||2 ^ = f • Now by Lemma 3.2, /|-norm estimator 

will reject v with probability at most By the union bound, the probability that we get 

rejected at step |3| of the algorithm is at most 

For any two vertices u, v from S, if u, v belong to the same large cluster, then by Conditions 
of good vertices and by Lemma [4 .21 , ||p 1 — Now recall that we have set b = 


216sk 

n 

r > 


= 


— (5 = 

4n’ ^ 


1 


and r = 192cj3u|s\/ skn In s in Theorem Then b > max{||p:^|| 2 , Hp^Hi 




^ c| 3 u|- ^ In and we can ensure that with probability at least 1 — 5, any call to ^ 2 -Distribution- 


Test will accept the distributions p„, p* if u, v belong to the same large cluster. By the union bound, 
the probability that there exist some call such that the distribution tester does not accept u, v if 
u,v are from the same cluster is at most s‘^6 < Therefore, the probability that the algorithm 
does not reject at step ^ and all the calls to the ^ 2 -Distribution-Test return the correct answer 
is at least 1 ~ ~ = §• 

Now note that if for any u,v G S such that u, v belong to the same cluster, the distribution 
tester with input p^,p^ accepts, then there will an edge {u,v) in the “similarity graph” H. This 
further implies that all the vertices in S that are in the same cluster will form a clique. (But note 
that two sampled vertices from two different clusters might also be connected in H.) Since there 
are at most k clusters, we will get at most k connected components in H, and thus the tester will 
accepts G. □ 


We can now apply Claims 4.7 and 4.8 to conclude the proof of Lemma 4.6 


□ 


4.2.2 Soundness — rejecting graphs e-far from (/c, (/>*)-clusterable 

We present now a proof of the soundness of our tester. 


Lemma 4.9. Let 7 = 'yd,k > 0 be some constant depending on d, k. If the input graph G = (F, E) 

2 

is £-far from {k,(j)*)-clusterable with (j)* < then the algorithm k- Cluster-Test rejects G with 
probability at least |. 


Proof. We will use 7 = min{^ 
Lemma ^.5| implies the existence 
for each i, 1 < i < k + 1, \Vi 


Ki = 


1 

1152fe 


4^ , Q^} . Let us first observe that our choice of 7 ensures that 
3 m a partition of V into k + 1 disjoint sets Fi,..., Vk+i such that 
> Kie^|F| and (/>g(F) < for appropriate parameters 


and K 2 = 


Let a = ^ (here a corresponds to the parameter a used in Lemma ^). For every set F; 
1 < i < A: + l, letF ^ Fbe the set of vertices v G Vi such that the probability that the 
random walk of length £ starting at v does not leave F is at least 1 — . We observe that 

since (jciVi) < K24>*£~'^, we have |F| > (1 — ct)|F| by Lemma 4.4. Hence, our assumption that 
|Fl > Kie^|F| implies that |Fl > (1 — a)Ki£^\V\. 

Let us call the sample set S chosen by the algorithm A:-Cluster-Test to be representative if 
F n 5 / 0 for every i, 1 < i < k + 1, and S C F. 

Claim 4.10. The probability that the sample set S is representative is at least |. 


Proof. For any set X C V, Pr[X n 5 = 0] = (1 — |X|/|F|)® < e ®l^l/l^l. Therefore, since |Fl > 
(1 —a)«;ie^|F|, the probability that S does not contain any element from F is smaller than or equal 
to Hence, the union bound implies that the probability that there exists 

some i < k + 1 such that S does not contain any element from F is at most (fc + 1) • . 

In addition, the probability that there exists some vertex in S that belongs to F \ (U^=i^ is 
at most s ■ a. Therefore, the probability that S is representative is greater than or equal to 
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1 — (A: + 1) • e a)Kie'^ _ gince S = ^536fcln^8(fc+l)) ^ _ q,)^^£2 > 

ln(8(A: + 1)), and hence we can conclude that this probability is at least |. 


□ 


Claim 4.11. If S is representative then the algorithm k-Cluster-Test rejects G with probability 
at least 


Proof. Let Si := Vi D S . Since 5 is representative, then S = IJi^i Recall that the algorithm 
/c-Cluster-Test rejects G if one of the following two cases happen: 


there is a f G S' such that ti-uoYYn estimator passes the testing of ||p 


v\\2 


> a. 


for any l<i<j<A: + l, and any vertex pair u, v such that u £ Si and v G Sj, (n, v) is not 
an edge in the “similarity graph” (because in that case the resulting graph H could not be a 
union of at most k connected components). 

Ip* III > cr, then by Lemma 3.2, /|-norm tester with rejects v 


If there exists some v £ S with 


with probability at least 1 — > 2 Therefore, we assume in the following that 


for every v £ S, ||p(,||2 < o'- Let us now observe that the probability that the algorithm A:-Cluster- 
Test would reject G is lower bounded by the probability that for any 1 < i < j < k + 1, and any 
vertex pair u, v such that u £ Si and v £ Sj, /2-L)istribution-Test rejects the distributions p^, p^. 

Our definition of sets Vi, V2j • • • j 14+i and the assumption on (which implies that I < 
2K2^*e-‘^ — 2 m&Ki{ 4 >c(yj) ) ^^sure that for any l<i<j<k + l, and any vertex pair u, v such that 


u £ Si and v £ Sj, we can apply Lemma 44 to obtain ||p 1 — p^Hl > We know, by Theorem 3.1 
and our choice of 6,^,5 in that theorem, that for every such pair Vi, vj, / 2 -Distribution-Test will 
accept the distributions p ^., p\j. with probability at most b. Therefore, the probability that there 
exists some vertex pair u, v such that u £ Si, v £ Sj, 1 < i < j < k + 1 and {u, v) is selected as an 
edge in the “similarity graph” (which would mean that / 2 -Distribution-Test will accept p:^,p:^) 
is at most • 6. Therefore we can conclude that the algorithm /c- Cluster-Test rejects G with 


probability at least 1 — 


5>i. 


Now, the proof of Lemma 4.£ follows directly from Claims 4.10 and 4.11. 


□ 

□ 


We set ^ in Theorem 1.2. By our choice of s and i, we can find a constant c* = c(, 


d,k 


that depends on d and k satisfying this condition, and we then require that 4>* < 


4.2.3 Running time 

Now we analyze the running time of the algorithm /c-Cluster-Test. First note that to sample from 
distributions Py for any v £ V, we need to perform r random walks of length £ from v and the 
corresponding time is 0{lr). Note that each invocation of either distribution tester runs in time 
linearly in the number of samples, that is r. Since we sampled s vertices, invoked /|-norm tester 
for each vertex in the sample set S, and invoked / 2 -Distribution-Test for each vertex pair in S, we 


) = 0 ( 


know that the total running time of the algorithm is 0{isrl-rs+s'^^ In ^ 

This completes the proof of Theorem |L^, which follows directly from Lemmas 
our analysis of the running time given above. 


In - Inn^ 


and 4.9, and 


5 Proofs of central properties (Lemmas |4.2| — |4.5|) 


In the following, we will prove Lemmas |4.2| - 14.5| . Before that, we present two spectral property on 
the eigenvalues of {k, (fin, (fout)-clnsteicahle graphs, which might be of independent interest. 
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5.1 Spectral properties of clusterable graphs 


Before we state the spectral properties of clusterable graphs, we first observe that it will be sufficient 
for us to consider weighted d-regular clusterable graphs. This is true since our algorithm actually 
performs the lazy random walk on the (virtual) weighted d-regularized version Greg of the input 
d-bounded degree graph G. In addition, under our definition, for any set SCI/, the outer 
conductance (pciS) and inner conductance (/>(G[S]) of S in G are the same as outer conductance 
(pGregi^) inner conductance (/>(Greg[S]) of S in Greg, respectively. For this reason, in the rest 
of this section, we will assume that G is a weighted d-regular graph. 

The proofs of spectral properties of clusterable graphs rely on a recent high-order Cheeger 
inequality by Lee et al. [|LOT12 |. To state the inequality, we first introduce some notations. 

Let A denote the adjacency matrix of G. Let £ = I — ^ A be the Laplacian matrix of G, where 
I is the identity matrix. Let A* be the iih. smallest eigenvalue of the Laplacian matrix C and let Vj 
denote the corresponding (unit) eigenvector. Note that the probability transition matrix of the lazy 

random walk on G is W := , and it is straightforward to see that {1 — %-}i<j<n is the set of 

eigenvalues of W with corresponding eigenvectors {vj}i<j<„ (cf. Appendix |A.1| for more details). 

For a d-regular graph G, let pci^) denote the minimum value of the maximum conductance 
over any possible k disjoint nonempty subsets. That is, 

PG{k) . min max (pciSi) . 

disjoint ,..., Ofc l<i<K 


Lee et al. ||LOT12 | proved the following higher-order Cheeger’s inequality. 

Theorem 5.1 ( [|LOT12 |). For any weighted d-regular graph G and any k >2, it holds that 

Afc/2 < pcik) < , 

where eg is some universal constant. 

Remark. Lee et al. actually proved a stronger version of the above theorem that applies to any 
weighted graph, by using a volume-based definition of conductance (see Appendix |A.2| ). The weaker 
version given by Theorem 5.1 will be enough for our application. 


Now we are ready to state the spectral properties of clusterable graphs, which are given in 
the following two lemmas. The first lemma says that in a fc-clusterable graph there is a large gap 
between and A/i+i for some h < k. 


Lemma 5.2. If G is weighted d-regular and {k, (pin, (pout) -clusterable, then there exists h, 1 < h < k, 


such that Xi < 2(pout for any i < h, and Aj > for any i>h + 


1 . 


Proof. Since G is {k, (pin, <pout)-clnsteicahle, then for some h, 1 < h < k, there exists a partition of 
V into h sets Gi,... ,Ch, such that (p{G[Gi]) > (pin and (pG{Gi) < (pout for any i < h. From the 
latter, we obtain that pcih) < max, (pciCi) < (pout and then by Theorem |5.1| , Xh < ‘I(pcnit, and thus 
for any i < h, Xi < Xh < 2(pout- 

Next, let us consider an arbitrary [h + l)-partition Pi,..., Ph+i of V. We note that there must 
be at least one set in the partition, say Pig, such that \Pig nGj| < ^IGjj for every 1 < j < L. This is 
true since otherwise, for every i, 1 < i < h + 1, each Pi would contain more than half of the vertices 
of some cluster, say G 7 r(i), that is, |PinG 7 r(i)| > ^|G^(j)|. Then, since there are h clusters Gi,..., Gh, 
by the pigeonhole principle there would have to exist two indices i and j,l<i<j<h-\-l, such 


13 















that 7r(i) = vr(j). This would mean that each of Pi and Pj contain more than half of the vertices 
from the same cluster which is a contradiction since Pi and Pj are disjoint. This proves the 

existence of the set 

Let P := Pig. For every 1 < i < h, let Bi := P Ci Ci. Since each cluster Ci has large inner 
conductance, namely 4>{G[Ci]) > cl)in, and since \Bi\ < ^\Ci\, we have e{Bi,Ci \ Bi) > (j)ind\Bi\ for 


every 1 < i < h. Hence, 4>g{P) 


e{Py\P) ^ EtplPPABd ^ 
'^ 1^1 - - 


4>in, and thus pcih + 1) > (pin 


Therefore Theorem gives (pin < pci^ + 1) < c] 5 n|h^ y^Afe+i, which yields Xh+i > 


F- 


□ 


The second lemma states that in a A:-clusterable graph, for any large cluster C, the average value 
of (vi(u) — Vi(u))^ over all \C\’^ vertex pairs u,v £ C is as small as ), for any i < h < k. 

\^in 


Lemma 5.3. Let G = {V,E) be a weighted d-regular graph that is {k, (pin, (pout) -dusterable and 
let C P V be any subset with (p{G[G]) > (pin- Then there is h, 1 < h < k such that for every i, 
1 < i < h, the following holds: 


1 t t \ t \\2 ^ 8d'^(pout 

— 2 ^ {Vi{u) - Vi{v)) < 


^2 

in 


Proof. Since G is (A:, ^o^^j-clusterable, by Lemma [^ , there exists h, 1 < h < k, such that 

, o 

and Aj < 2(pout for any 1 < i < h. Hence, for any i < h, hy the variational principle 


A/i+i > 




of eigenvauies (see Fact A.2 in Appendix), we have 


— 7 S ^(pout 

a 


( 1 ) 


Let us recall a known result (see, e.g., [ Chu97| , (1.5), p. 5]) that for any weighted graph H = 

{Vh,Eh)B 


X 2 {H) = volniVn) ■ min 




( 2 ) 


where X 2 {H) denotes the second smallest eigenvalue of the normalized Laplacian of H, the volume 
voIh{S) of a set S' C Vh is the sum of degrees of vertices in S, that is, voIh(S') := Xt;eS dniv). 

Let us consider the induced subgraph H := G\C] on G. Let (pYf^{S) := and (/)™*(P) := 

ming: voig (S) <voig {Vh)/2 ""vfig Appendix [V^). Since (p{H) > (pin, then it is straightforward 

to see that <p''°^{H) > Cheeger’s inequality (cf. Theorem |A.3| ) yields X 2 {H) > Therefore, 
if we apply this bound to inequality then. 


yo\h{Vh) 




E 


u,vGVh 


[vi{u) - Vi{v))'^dH{u)dH{v] 


> X2{H) > 


4>ln 

2d^ 


^We remark that in [ ]Chu97| ], the summation in the denominator is over all unordered pairs of vertices, while in 
our context, the summation is over all possible |Vh|^ ve rtex pa irs. Therefore, a multiplicative factor 2 appears in the 
numerator in equation (^), compared with the form in [|Clhu97 , (1.5), p. 5]. 

^This can be verified by considering the set S with volif(S') < volir(yH)/2 such that if 

l^l < hid, then > ct>HiS) > if |5| > hid, then where the 

penultimate inequality follows from the fact that cpjffVH \ S) = d 4>in and the last inequality follows from 

that l^l < voIh(S') < yoIhIVh \S)< d\VH \ S|. 
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Combining this with the fact that ^)g£;^(vi(u) — Vi{v ))^<E (u,d)S-Eg '^*(^)) — 

2d(pout, where the last inequality follows from inequality (|l|), we have that 

^ ^ ^ ^ ^ / 8dHolH{VH)(pout 

} , (Vi(u) - Vi(u)) dH[u)dH{v) < - -2 - . 

u^v^Vfj 


Next, since (piH) > (pin > 0 implies that d}j{u) > 1 for any u G Vfj, and since the fact that for any 
u G Vh, duiu) < d yields voIh(Vh) < | = dlC*!, using the bound above we obtain: 

^ \ z' ^^2 / ^ ( ^^2 . / N . ! \ ^ 8d?^o\H{yH)(pout . 8d‘^\C\(l)out 

2^ (Vi(u) - Vi(u)) < 2^ {^i{u) -Wi{v)) dH{u)dH{v) < - -2 - < - -2 - ■ 

U^V^Vh U^V£Vh 


The completes the proof of Lemma 5.3. 


□ 


Remark. In Lemma |C.l| we show that Lemma ^.3| is essentially tight for k = 2 and constant (pin- 
We prove that there is a ( 2 , (pin, (/)oMt)-clusterable graph G with clusters Ci, C 2 such that for at least 
one cluster, say Ci, the average value of (v 2 (tt) — V 2 (m))^ between vertices u,v from Ci is | ). 


5.2 Proofs of Lemmas 4.2, 4.3, 4.4 


In this section, we prove Lemmas |1.2| ~ |4.4|. For a d-bounded degree graph G, recall that p* is the 


probability distribution of the endpoints of the lazy random walk of length t starting from v on 
Greg- Let Wreg be the probability transition matrix of the lazy random walk on Greg and let be 
the characteristic vector on vertex v. Then p* = l,;(Wreg)*. 

In this section, let denote the ith smallest eigenvalue of the normalized Laplacian matrix 
of the regularized version Greg of G and let be the corresponding unit eigenvector. 

Now we prove Lemma 4.2, which shows that the ^ 2 -iiorm of the difference of two random walk 
distributions p* — p^ is small for most pairs u, v from the same cluster for t large enough. 


Proof of Lemma \4-^ - For the d-bounded degree graph G, we apply Lemma 
d-regular version Greg. For the subset G, by defining Ac,i := 
following: 


.ecvr"(u) 


to its weighted 
, we obtain the 


E 

ueC 


(v”<(.,) - Ao,)^ = ^ E < 


4dV, 


out 


ICI 


u,vGC 


rin 


where we used the elementary identity - Yli<ji^i ~ ~ ai, • • •, an- 

Therefore, the average of — Ac^j)^ over all vertices in G is at most This 

II ^in 

implies that for at least (1 — a)|G| vertices tt G G, we have (vj®^(u) — for all i, 

1 < i < h < k. Let G 'L G denote the set of vertices with this property. 

Consider any two vertices u,v G G. We observe that for any i, 1 < i < h, we have (vj®®(rt) — 
< 2((v[®®(ri) — Ac^j)^ + — Ac^j)^) < where the first inequality that 

^1^ \^in 

(x — y)^ < 2((x — z)^ -j- (z — y)^) follows directly from the Cauchy-Schwarz inequality, and the 
second inequality follows from the property of vertices in G. Next, by Fact |A.1| we have p^ — p^ = 
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J l|2 


reg 

Vi 

\reg 

w)(i-V 


and therefore 



= 

n 

E(''r‘(«) 

-v^ 


^reg 

2 

^2t 



Elv'"!-) 

reg 

-Vi 


^reg 

2 

)2t + 

n 

E 


i=l 




i 

=h-\-l 

< 

E(vr'(«) 

reg 

-Vi 

{v)f + (1 

\reg 

^h+1 

9 1 

n 

E (+"( 


i=l 





i=h-\-l 

< 

IGhkd'^ifout 

»\C\(Pin 

+ 4(1 

(Yin 




< 

IGk'^d'^efout 

+ 4(1 

^n 







2t 


In the bound above, in the penultimate inequality we use the fact that Y17= 

-2 

1 for any u G V (by Fact ) and (by Lemma p.2| ) , and in the last inequality we 


use that \C\ > j3n. Now by defining := Q^(a, /3, d,/c) = ''^•2 letting 


t > -^e can conclude that ||p(, — p(j||| < 

in 


□ 


To prove Lemma we again use the eigen-decomposition of vector p(j as given in Fact A.l 
and the fact that all eigenvalues of the normalized Laplacian of Greg are large except for the Hrst 
few ones. This allows us to bound the norm of p(j by its projection on the first few eigenvectors. 

Proof of Lemma For any vertex u G V, let 6{u) := vj®^(rt)^. Since each eigenvector 

is of unit length, we have 

^ S{u) = Y, ^ ^ ^ v["®(n)2 = k . 

u^V u^V i i u^V 

Therefore, the expected value of 6{u) is at most and by the Markov’s inequality, we know that 
for any 0 < g < 1, there exists a subset V' CV such that \ V'\ > (1 — g)|lA| and that for any u gV, 

5{u) < In addition, by Fact 1„ = ELi and p(, = ELi vr®(^^)(l - 

Therefore, 


ipiiii = iiE''r'(“){i 

i=l 




i reg II 2 _ 


n , reg 

i=l 

k > reg n \ reg 

V)”+ E vr*w"(i- 

i=l i=k+l 

k \reg n 

Evr»(u)2+(i-^)“ E t'm" 

i=l i=k-\-l 

yveg 

< 5{u) + {l-^f^ 
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'in \2t 


< ^ il 

~ gn 2 c|iA:^ 
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j2 

where in the last inequality, we used the fact that ^ "jl 4 by Lemma 1^. In particular, the 


last bound implies that if t > ^ for cO := cl^, then ||p ^||2 < ||- 


□ 


Now we give the proof of Lemma 4.4, which shows that the ^ 2 -iiorm of the difference of two 
random walk distributions p* — p^ is small for most pairs u, v from the two different clusters if t is 
not too large. For any vector p and vertex set S, let p(5) := 


Proof of Lemma For any given subset A , vertex v £ A, and integer t, let rem(u, t, A) be 
the event that the lazy random walk of length t starting at vertex v never leaves A in all t steps. 
Let I A be the diagonal matrix such that lAiv, v) = I'Av ^ A and 0 otherwise. Then the probability 
that the walk stays entirely in A is (l^(Wreg/A)*)(^)) that is, Pr[rem(u, t, ^)] = (l^(Wreg/A)*)(^)- 
We will use the following claim. 

Claim 5.4 (Proposition 2.5 in | ST13|] ). For any t > 1 and any subset A <ZV such that 4 >g{A) < 
we have ^ > 1 _ t^/ 2 . 

Let Qa = {v : Pr[rem(u,t, ^)] < 1 — ^}. Then, 


l-^(Wreg/Ar(Al) = ^^(l-l.(Wreg/A)*(Al))> ^(l-U(Wreg/Ar(^))> 




1^1 


v&Qa 


1^1 


\Qa\ tip 

1^1 2a 


From Claim and the inequality above, we conclude that \Qa\ < a\A\. Therefore, if we set 
4. = ^ \ Qai then |^| > (1 — a)|^|, and for any v G A, Pr[rem(u, t,^)] > 1 — This proves the 
hrst part of the lemma. 

To prove the second claim, we continue similarly and set Qb = {v : Pr[rem(u, t, i?)] < 1 — 
and dehne B = B\Qb, to obtain that \B\ > (1 —a)|i3|, and for any v G B, Pr[rem(u, t, B)] > 1 — 
Hence, for any t > 1 and 0 < a < 1, for any u G A and v G B: 

tip 


P„(^) > Pr[rem(M,t,H)] > 1 - 


2 a 


and Pv{B) > Pr[rem(u, t, H)] > 1 — 


tip 


Since A and B are disjoint, we have p^(^) < P^(P \ B) = 1 — pUB) < Therefore, for any 


t > 1 , 


IP,. 


tu ^ IIP«-Pcli 2maxijci/|pU-R)-P„(-R)l ^ 2 (p(,(H)- p*(H)) 

P „||2 > ---=- > 

t^l, 

2a 2a' 


n 


n 


> 


2(1-S-S) 2(1-?) 


\/n 


n 


In particular, if t < then ||p^ — p^lb > ^ and therefore ||p^ — P^Hl > 


\/n 


□ 


Remark. It would be tempting to use in the above proof a somewhat stronger version of Claim 


Proposition 3.1]). However, in our proof we we require the fraction of vertices in A to be as large as 
1 — a for any small a > 0, which we are not aware if it is true in the stronger version of Claim |5.4|. 


5.4 that lower bounds the escaping probability by 12(1) • (1 — 3'0/2)* (see, for example, [pTl^, 
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5.3 Partitioning into large sets with small cuts: Proof of Lemma |4.5| 


In this section, we assume that e < 5 and we prove Lemma 4.5 that asserts that if a graph is far 


from fc-clusterable then its vertex set can be partitioned into k + 1 sets with low outer conductance. 
Let 0 < Cexp < ^ be a constant such that for d = 3 and every n, there exists a graph H with n 
vertices and maximum degree d = 3 that has (p{H) > Cexp- The proof of the next lemma follows 
the ideas from [ CS10 |, but it is adapted to edge expansion and works also for d = 3 (the analysis 
in [ CSIO requires d>4). 


Lemma 5.5. Let a < If for a graph G = {V,E) there is A C V with 1^41 < \s\V\ sueh that 
(/>(G[P\^]) > • a for some suffieiently large eonstant csl, then G is not e-far from every graph 

H with (j){H) > a. 


Proof. Let be a sufficiently large constant whose value will be determined later. Let G be a 
graph as in the lemma and let A V be an arbitrary set such that A C V with 1^41 < and 

cf{G[V\A])>c^ ■ a. We will turn G into a graph H by modifying at most edn edges of G and 
then prove that 4>{H) > a. This will conclude the proof. 

Our construction removes all edges between vertices in A and adds an expander graph with 
maximum degree 3 on A that has a constant fraction of vertices of degree 2. The degree 2 vertices 
are then connected to vertices P \ ^. In order to not violate the degree bound, we have to remove 
some edges between vertices in P \ ^4, which is done using the following construction. 

We will first construct an auxiliary set S of size [|^|/4]. Each element of set S is an edge {u, u} 
for some u,v G (we allow selfloops). The set S can be constructed by the following algorithm. 

ConstructS(G, 

QL = {u G V \ A : dciu) < d — 2} 

S' = {{u,u} : V G Ql} 

U = {V\A)\Ql 

while there is u G 17 with at least one neighbor in U do 
let G 17 be a neighbor of v 
S' = 5'U{{u,u}} 

U = U\{u,v} 

return set S defined as an arbitrary subset of S' of size [1^1/4] 


We prove that Constructs ensures that |5'| > gjEl, which implies that the last step of the 
algorithm can always be executed and we get |5| = [ 1741 / 4 ]. 

Claim 5.6. If algorithm Constructs is invoked with A that satisfies |74| < |elI7|, 0 < e < 
then the constructed set S' has size at least ||I7|. 

Proof. We first observe that at the end of the algorithm, each vertex in U has degree at least d — 1 
and all the neighbors of vertices in U belong to V \ U. This implies that the number of edges 
connecting U and E \ 17 is on one hand, at least {d — 1)|17|, and on the other hand, it is at most 
d\V \U\. Therefore, d|E \ 17| > (d — 1)|17|, and since d > 3, this yields |E \ 17| > ||17|, and thus 

Now, we observe that IS"] > ^|(E \ ^4) \ 17|, and therefore IS"] > |(|E| — — |17|) > ^(lE] — 

“ fl^l) = ^1^1 — bI^I’ every A that satisfies the prerequisites of the claim. □ 
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We next describe our construction of the graph H. If |^| > 10, then we proceed as follows. 
We partition A into two sets A' and A", with \A!'\ = 2 • [1^1/4]. Let H' = {A!,E') be a graph 
with degree at most 3 and > Cexp, whose existence follows from our definition of Cexp- Since 

adding edges (while maintaining the degree bound) does not decrease the conductance and since 
|A| > 10, we may assume that H' has at least \A"\ edges. Let H* = {A,E*) be a graph obtained 
from H' by taking an arbitrary set of \A"\ edges from E' and replacing them by a path of length 
two, whose intermediate vertex is from A" in such a way that every vertex from A” is used exactly 
once. 

If 1 < |A| < 10 we define H* = {A,E*) to be a path and choose A” to be an arbitrary subset 
of A of size 21"^]. If |A| = 1 we define H* = {A, E*) with E* = 0, and set A' = 0 and A” = A. 

Now we will modify G by changing at most edn edges to construct graph H such that > a. 

We first remove in G all edges incident to A and then all edges that connect the sets s G 5 in G 
(i.e., we remove from E all edges {u,v) with u,v G s). Then we add an arbitrary perfect matching 
between the vertices in A" and S (if a vertex appears twice in s G jS then it will be matched to two 
vertices of A"; if = 1, then the vertex v from A!' will be match to both vertices from s a S. If, 
in this case, s = {u^u) we only add the edge {u,v)). Finally, we add all edges E* from the graph 
H* defined above. 

Our construction creates a new graph H from G by making at most {d + 1)|^| edge deletions 
and 3|A| edge insertions. Hence, we modified at most (d + 4)|H| < sd\V\ edges, as required. 

Next we prove that > a. We begin with two auxiliary claims about construction of H. 

Claim 5.7. Let X CV be an arbitrary set of size at most \\V\. Then the following holds: 

eniX, V\X)> 4cexp • min{|X 0 HI, |H \ X|} . 
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Proof. If |H| = 1 the claim trivially holds for every set X. Thus, we can assume |H| > 2. Let X 
be a subset of V of size at most If |H| < 10, we get eniX, \ X) > eniX H H, H \ X) > 

^ • min{|X n H|, |H \ X|}, since either the minimum is 0 or there is at least one edge connecting 
the two sets. Since Cexp < this implies the claim. 

Now we consider the case |H| > 10. Consider an arbitrary set Y C A with |y| < ^\A\. Let 
Y' = Y 'A A' and Y” = Y A A”. Let us first focus on the construction of graph Ef* (which is a 
subgraph of H). Let Y* C Y" be the set of vertices from Y" with both of its neighbors (in H*) to 
be in Y (and hence, in fact, in Y' C A'). 

We consider two cases. If \Y" \ T*] > ^\Y\ then since each vertex in Y" \ Y* is adjacent in H* 
to at least one vertex not in Y, we obtain eH*{Y,A \ Y) > \Y'’ \ y*| > \\Y\. 

Otherwise we have \Y'' \ y*| < ^iTj, and thus \Y'\ + |y*| > ^\Y\. Since each vertex in Y' has 
degree at most 3 in H* and each vertex in Y* is adjacent in H* to exactly two vertices from Y', 
we have |y*| < ||y^|. Hence, if we combine the bounds |y'| + |y*| > ^\Y\ and |y*| < then 

we obtain \Y'\ > ^|y|. Now we make another case distinction. 

If |y'| < then |H' \ y'| > ^|H'| > ^|y^|. Note that in our construction of H* from H', 

if an edge {u, v) with u G H' \ T' and u G T' is replaced by a path of length 2 with intermediate 
vertex w G H", then at least one of the edges {u,w) and {v,w) lies between Y and H \ T in H*. 
Therefore, 

eH^{Y,A\Y) > eH'iY',A\Y') > 3ce..pmin{|y'|, [H'\ y'|} > lcexp\Y'\ > ^Cexp\Y\ . 

Otherwise, |y'| > A \A'\. In our construction we replace 2- [|H|/4] edges of H' by paths of length 
2. Since \A' \ Y'\ < and since H' has maximum degree 3, there are at most ^|H| 

edges with both endpoints in A'\Y' that are replaced. Therefore, there are 2[|H|/4] — ^|H| > ^\A\ 
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edges replaced that in H' are incident to a vertex from Y'. Thus, in H* there are at least edges 
leaving Y'. Since |y'| > and \A'\ > ||^|, we have |y'| > ^|^|- Therefore our assumption 

that \Y\ < ^\A\ yields |T\y'| < -^1^1. This gives us eH*{Y,A\Y) > ^\A\ - ^\A\ = > 

l\A\ > l\Y\ > j^Cexp|T|. 

Therefore, we get eH*{Y, A\Y) > ^Cexp • |T| for the case that \Y" \ y*| < ^\Y\. 

If we combine the bounds for these two cases together, then we obtain that for any Y Y A with 
|I^| < ^l^l) we have eH*{Y,A \ Y) > min{^|y|, ^Cexp|y|} = ^^CexplTl- This further implies that 
for any Y Y A, eH*{Y,A\ Y) > ^^Cexp min{|yl, \A \ y|}. 

Now we will extend the analysis to the graph H. We have eniX, V \ X) > eu* (X n A, A\X) > 
Xc,^pmm{\XnA\,\A\X\}. □ 

Claim 5.8. Let X Y V be an arbitrary set of size at most ^\V\, A Y V with |^| < \s\V\ and 
e < ^. Then the following holds: 


eHix,v\x) > |•cO•d•a•|(y\^)ny| 


— min{|y n A], |j 4 \ y|} . 


Proof. For simplicity of notation, let us define B = V \ A. Using the assumption |^| < ^e|U| and 
e < ^, we obtain \B\ > (1 — |e)|U| > Therefore, since \B Y X\ < |X| < ^\V\, we obtain 

\Br\X\ < ^ • \B\, and hence |i?\X| = \B\ — \Br\X\ > ■ \B\, what yields min{|i?ny|, |-B\X|} > 

||i? n X\. Next, by the assumption about set A in Lemma |5.5| , we know that (j){G[B]) > • a. 

Therefore, eQ^^]{B D X,B\ X) > min{|i? n Xj, |i3 \ X|} > |cUad|-B n X|. 

The only edges that are removed from G[B] in order to obtain H are the edges between vertices 
u, V with u,v £ s for all s £ S. Consider such an edge {u, v) with u,v £ s, s £ S. Since we are 
analysing the size of the cut between BYiX and B\X, we only consider u £ BY X and v £ B\X. 
By our construction of H, both u and v are connected in H to vertices in A. If u is connected to a 
vertex in A \ X or u to a vertex in AY X, then we get a new cut edge between B Y X and B\X, 
and thus this will compensate the removal of edge {u,v) from G[B]. Therefore, we decrease the 
number of edges in the cut between B Y X and B \ X only if u is connected to a vertex in ^ n X 
and V is connected to a vertex in X \ X. Each vertex in A is adjacent in H to at most one vertex 
from outside A, and therefore the number of such edges is bounded by min{|X n ^|, |^ \ X|}. 

If we summarize this, we obtain eniX, V \ X) > {BYX,B\ X) — min{|X n X|, |X \ X|} > 
ld^ad\B n X| - min{|X n A|, |^ \ X|} > n X| - min{|X n ^1, \ X|}. □ 

With Claims |5.7| and |5.8| at hand, we are ready to conclude the proof of Lemma |5.5| . Take an 
arbitrary set X C U of size at most \\V\. We will prove that e^CX, V \ X) > ad|X|, what would 
immediately imply that > a. 

If min{|X n ^1,1^ \ X|} > • |X|, then Claim 

Otherwise, we have min{|X n A\, \A \ X|} < . |x| < ^ • |X| for our choice of a. If the 

minimum is attained by |X n d.|, then we have |(U\X)nX| > ^-IXI- Thus Claim implies 
that assuming that c |^ > we have eniX, V \ X) > | • d • • a ■ > «d • |X|. 

If the minimum is attained by \A \ X\ we consider two cases. If |(U \ X) n X| < then 

1^1 < |A| + |(U\^)nX| < ^\A\. In this case, |Xn^| = |^\(^\X)| = |^|-|X\X| > |X|-|X|/10 > 
^1^1 > ^\A\. Since \A"\ > ^\A\ we obtain that |X n A"\ > |X n X| — \A'\ > ||X| — ^\A\ > ^|^|. 
By construction of H each vertex in A" is connected to a vertex in U \ ^4 and each vertex in U \ ^4 
is connected to at most 2 vertices in X". Since |(y\^)nX| < ^|X| there are at most ||X| vertices 
of (U \ ^) n X connected to vertices from X n A”. Hence, for our choice of a there are at least 
^1^1 - il^l > > ^1^1 ^ “^1^1 edges leaving X. 


5.7 gives that eH{X,V \ X) > ad|X|. 
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If |Cl/\A)nX| > ^1^1, then |(y\^)nx| > and thus |(y\^)nX| > ^\X\. Now 


if 


> Claim 1^ gives that eniX, C \ X) > | • d • cO • a • -w — ^^^\X\ > a;d|X|. Therefore, 


1^1 


15-ad I 


Lemma ^.51 follows with 


350 

Cexp 


□ 


Lemma 5.5 can be applied to construct a large set A with a small cut, as in the following lemma. 


Lemma 5.9. Let 0 < a < and 0 < e < ^. If G = {V,E) is e-far from any graph H 
with 4>{H) > a, then there is a subset of vertices A E V with i^e|C| < 1^41 < ^\V\ such that 
f>G{A) < (Q-a, for some sufficiently large constant c^. In particular, e{A,V\A) < ■a-d-\A\. 

Proof. Lemma |5.5| ensures that if G is e-far from any graph H with (t>{II) > a, then for all A' CV 
with \A'\ < |e|y| we have (^(G[IL \ A'\) < cQ • a. In particular, in our case, this will mean that 
there is a set S C C \ ^' with \B\ < ^\V \ ^^such that e{B, (V \ (A' U B)) < ■ a ■ d ■ \B\. 

We will now repeatedly apply Lemma p.5| to construct a large set A satisfying the requirements of 
LemmaLet Ai = 0. We apply Lemma p.5| with A' = Ai to obtain a set A 2 with \A 2 \ < 
and (?!>g[V\A'](^ 2 ) < c^- a. If \Ai U A 2 I > \s\V\ then we are done. Otherwise, we set A' = AiU A 2 
and repeat this process. We continue this process until for the first time, we obtain a set Ai such 
that 1^1 U • • • U Ai\ > ge|C|. In that moment, if \Ai\ > \Ai U • • • U Ai-i\ then we set A = Ai and 
otherwise, we put A = AiU ■ ■ ■ U Ai. 

Our construction ensures that since 1^41 U • • • U Ai\ > then we have 1^41 > ^e|C|. The 

upper bound on the size of A follows since \Ai\ < \\V\ and \Ai U • • • U Ai_i| < ^e|C|. 

Our construction ensures that for every 1 < j < i, e{Aj, V \ (^1 U • • • U Aj)) < ■ a ■ d - \ Aj\. 

Therefore, since we have e{Ai U • • • U IL \ (^1 U • • • U Aj)) < V \ (^1 U • • • U Ag)), 

we conclude that e{Ai U • • • U Aj, V \ (^1 U • • • U Aj)) < ■ a ■ d ■ \Ai U • • • U Hence, if 

A = AiU ■ ■ - VI Ai then we obtain e{A, H \ H) < ■ a ■ d ■ |H|, and \i A = Ai then we obtain 


e{A,V\A) = e(Hi,HiU---UHi_i) + e(Hi,C\(HiU---UHi)) 

< e(Hi U • • • U Ai_i, H \ (Hi U • • • U Ai_i)) + e(Ai, H \ (Hi U • • • U H^)) 

< cj^ • a • d • |Hi U • • • U Hi_i| cU • a • d • |Hil 

< 2c|^- a • d - |H| , 


where in the last inequality we use the fact that |H| = |Hj| > |Hi U • • • U Hj_i|. 

This completes the proof by setting — 2c^. □ 

Let us extend the notion e{Ui,U 2 ) to multiple sets and for disjoint subsets Vi,...,Vh, let us 
dehne e(Hi,..., Vh) = T.i<i<j<h ^j)- 

Lemma 5.10. Let G = {V,E) be s-far from {k,(j)*j,^,(j)*.,,^^)-clusterable and < Cexp/d. If there is 
a partition of V into h sets Vi,... ,Vh with 1 < h < k, such that e(Vi,..., Vh) = 0, then there is an 
index i, I < i < h, with \Vi\ > ^ • e\V\ such that G[I^] is ^-far from any H on vertex set Vi with 
maximum degree d and (p{II) > 


Proof. Let us renumber the indices of sets Vi,... ,Vh such that |Vi| > |Vi+i| for every i, 1 < i < h. 
A set Vi with more than vertices is called large and otherwise it is called small. Let s be 

the largest index such that Vg is large. (Simple counting arguments implies that we must have 
I hi I > (for otherwise we would have | Vi 1 < -^ for every i, 1 < i < I, and thus 1^* l<|fo|, 

which is a contradiction to the fact that Vi,... ,Vh is a partition of V), and hence Vi is large and 
s is well-dehned.) Next, let us observe that ^i<i<h-Vi is small \Vi\ < ^ks\V\ = |e|fo|. This follows 
from h < k and from the fact that for a small set Vj we have \ Vi\ < ^e|fo|. 
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Let us construct from G a new graph G* of maximum degree at most d as follows. Define 
^ ~ UiiVi is small = uL+1 Vi, and remove in G all edges incident to any vertex in U. Then build 
a degree 3 Cexp-expander on U and add it to the graph. Note that with respect to d, this expander 
is a ^^-expander. Call the obtained graph G*. 

Observe that G* has been obtained from G by adding/inserting at most d\U\ + 3|17| edges, 
where the first term corresponds to the removal of all edges incident to U and the second term 
corresponds to building the degree 3 Cexp-expander on U. 

Now, since \U\ = Y^i<i<h Vi is small 1^*1 — have shown above, we note that G* is 

obtained from G by adding/deleting at most 2-|e|’L| < ^e\V\ edges. Hence, since G* has maximum 
degree at most d, G* is ^e-far from (A;, (/)*„J-clusterable. 

Observe the structure of G*: it consists of a ^^-expander on U and s disjoint components (not 
necessarily connected) on vertex sets V) with each Vi being a large set and G*[Vi] = G[Vi]; further, 
(^G*{U) =^G*{Vi) = --- = (^G<Vs) = 0 . 

For every f, 1 < z < s, let us define Hi to be the graph on vertex set Vi with maximum 
degree at most d, with (t>{Hi) > and that is obtained from G*[Vi] by the minimum number of 
addition/deletion of the edges; let Hi be the number of addition/deletion of the edges needed to 
transform G*[Vi] into Hi. 

Let us observe that the graph H on V obtained as the union of G*[U] and H\,... ,Hs is 
-clusterable. Indeed, since we have H\U] = G*\U], H\yi\ = Hi for every i, I < i < s, 
and 4>h{U) = (j)H{Vi) = • • • = 4>h{Vs) = 0, for the partition of V into U, Vi,... ,Vs, we obtain that 
(p{H[U]) > Cexp/d > (/>*„ for every i, 1 <i < s, and (/>h(G) = = • • • = (t>H{Vs) = 0 < 

We now note that H is obtained from G* by adding Yli=i edges. Therefore, since G* is ^e-far 
from -clusterable, since H is {k, (/>*^i)-clusterable, we must have > ^^d\V\, 

and thus Y^i^i^i > 1^1- Therefore, there must be at least one j, 1 < j < s, with 

Kj > ^ed\Vj\. In that case, for such a j, by the definition of Hj, G*\Vj\ = G[V^] must be |e-far 
from any graph Q on vertex set Vj with (j){Q) > (any such a graph Q must be obtained from 
G*[Vi] by at least Kj > ied|V^| addition/deletion of the edges), as required. □ 

We are now ready to prove Lemma We will set = min{|g^, thus we have 

«« Safa. ^ 

Our proof is by induction; we will construct a sequence of partitions {Hi}, (Hi, V 2 }) ■ ■ ■ j {Hi,..., H^+ij 
of H such that each partition {Hi,..., Vh} satisfies the following properties: 

2 

(a) \Vi\ > |H| for every i, I < i < h, and 


(b) e(Hi,..., H/,) < (h - 1) • dgl • • d • |H|. 


Our first partition is the trivial partition {H}, which clearly satisfies our properties. We then 


apply inductively Lemma 5.IC . Let us consider some partition {Hi,... ,Hi} with 1 < h < k and 
assume that this partition satishes (^ and (^). We will show how to refine it to obtain a partition 
{Hi,... , Vh+i} satisfying properties (|) and 0. 

Let us first remove from G all edges between pairs of all distinct sets H and Vj, 1 < i < j < h, 
to obtain a graph G'. Since (/>}„ < 2^1 J , we have removed e(Hi ,... ,Vh) < (/i — 1) • • (/>*„ • d • | H| < 
|ed|H| edges from G, and therefore is e/2-far from (fe, (/)*„j)-clusterable and such that our 
partition satisfies the prerequisites of Lemma |5.10 . 

■ |H| and 


Then, by Lemma 5.10, there is a set H* with 1 < f* < h, such that |H*| > 


— Sk 


G'\yi»] = G[H*] is |-far from any H on vertex set H* with maximum degree d and (/)(dd) > 
Next, we apply Lemma 5.9 on Hj* to obtain a set H C Hj* with ^ • |Vj*l < |*4| < ^|H*| such 
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that e{A, Vi* \ A) < d^(j)*^d\Vi* \ < This gives us our new partition {Vi ,... ,A,Vi* \ 

A,..., Vh}. 

Using the bound for the size of Vi *, we have 1^41 > ^ • lU* | > 1152 k ' 1^1 l^i* \^l > ^1^* I ^ 
^ • |U|, and therefore by the induction hypothesis, our new partition satisfies (§). 

In order to prove (^), we observe the following 




.,Vh) < e{Vi,...,Vi 

< (^ — 1) • c|^ 

= h-d^-<ji*^-d-\V\ 


...,Vh) + e(A, Vi* \ A) 


1^1 


where the second inequality follows from our induction hypothesis and the bound above. 

In summary, we have proven by induction the existence of a partition of V into k + 1 sets 
Vi,...,I4+i such that properties @ and are satisfied. Note that since property (^) implies 
that for every i, 1 < i < k + 1, e(Vi, V \ Vi) < k ■ • (j)*^ ■ d - |Ul, we have 


MVi) 


eiV,V\V} ^ k-c^-^*^-d-\V\ 
d\Vi\ - d-\Vi\ 


1152fc 


1152 -k"^ -dTi 


Therefore, Lemma 4.5 follows by setting cO = 1152 • k'^ 


C5.S- 


□ 


6 Conclusion 


We presented the first study of testing the clusterability of a graph in the bounded degree model, 
where we used both the inner conductance and outer conductance of a set to measure the quality 
of a cluster |OGT14]. Our main result is an asymptotically optimal (up to polylogarithmic factors) 
algorithm with running time 0{*/n ■ poly(d,/c, e)) to test if a graph is (A:, (/))-clusterable or is s- 
far from (/c, i;i)*)-clusterable for (j)* = Our tester uses new ideas of testing pairwise 

closeness of distributions of random walks starting from a pair of sample vertices and draws from 
that conclusions on the graph structure. One of the key techniques underlying our analysis is a 
new application of the recent results on higher order Cheeger inequalities LOT12| ]. 

For further research, one of the major open problem is to narrow the gap between cj) and 4>*, 
or to prove that the current gap is almost optimal for any tester with similar running time. As we 
discussed in Section [L^, fundamentally new ideas are needed here. 

It would also be very interesting to gain deeper insights of the structure of graphs that are e-far 
from (A;, (/>*)-clusterable, that is, to improve Lemma More specifically, is it possible to get rid 


of the dependency of e of the upper bounds for inner and/or outer conductance in Lemma 4.5? 
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Appendix 


A Useful tools from spectral graph theory 

In this section, we introduce some useful tools from spectral graph theory that will be used in our 
analysis. 


I+^A 


, and £ = I — 4A 


A.l Elementary facts from spectral graph theory 

Let G = (E, E) be a weighted d-regular graph. Recall that we let A, W = —^~ d-‘ 

denote the adjacency matrix, the lazy random walk matrix and (normalized) Laplacian matrix of 
G, respectively. 

Let I 5 to denote the indicator vector of subset 5 C R, that is, Is(r') = 1 if u G 5 and 
= 0 if u ^ iS. We let = !{,;}• For a vector p, let denote its transpose and let 
P(‘S') := It is useful to notice that for any probability distribution p on V, p(W)* is 

the probability distribution of the endpoint of a length t random walk with initial distribution p. 
In particular, we let p(j := 1„(W)*. 

Let 0 = Ai < A 2 < • • • < An < 2 be the eigenvalues of £ and let vi, V 2 ,..., be the correspond¬ 
ing orthonormal left eigenvectors |Chu97|. Let rji > rj 2 > ■ ■ ■ > rjn denote the eigenvalues of W, 
then it is easy to see that for each i < n, iji = 1 — ^ and Vj is the corresponding eigenvector, where 
Aj and Vj are the fth eigenvalue and eigenvector of £, respectively. Therefore, all the eigenvalues of 
W are non-negative and no larger than 1. Note that since £ (or W) is symmetric, its eigenvectors 
form an orthonormal basis of the Euclidean space By the eigendecomposition of 
W, we have W = YJl=i = ELiCl “ 

We have the following basic fact. 

Fact A.l. For any vertex u and t > 1, we have 

1. lu = YIi=l Vi{u)Vi, 

3. pI = Vi{u){l - ^yvi. 

Proof. Since form an orthonormal basis of we can represent 1^ in terms of this 

basis, say where a, G M for each 1 < f < n. By taking inner product with Vj from 

both sides, we can solve to get Oi = (1„, Vj) = Vj(tt), for any i < n. Furthermore, 1 = \\lu \\2 = 

= E^=lV.(u)^ and = (Er=i«A.)(Er=i(i - = Er=iv.(«)(i - 

This completes the proof of the fact. □ 

We also need the following simple fact of the eigenvalue Aj and eigenvector Vj of the Laplacian 


£, which is known as the Rayleigh quotient formulation of Aj (Chu97 ]. 


Fact A. 2. For any 1 < i < n, Xi = 


_ Vi{dl-A)vf _ T,(u,v)eEi'"iiu)-Vi{v)f _ T,(u,v)e 




dvivf T,udif[u) d 

Proof. By definition, Vj£ = AjVi. Multiplying in both sides, we have Vi£v^ = AjVjV^, which 
gives that A, = 

° ® dviVf 


Now noting that for any vector v, v{d^)^r'^ = dY^.^'Vi{u)‘^ = + ^{vf), and 

vAv^ = v(w)v(u) we have v(dl - A)v^ = + 
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-2v{u)v{v)) = J2(u,v)eE(^H-^Hy- Therefore, A* = " 


I2(u,v)eE( A ) — <■( )) ^ ^j^g last equation follows from the fact that Vj is a unit-length vector 


for any 1 < i < n. 


□ 


A.2 Volume-based definition of conductance and Cheeger’s ineqnality 

In this section, we introduce the volume-based definition of conductance that has been used fre¬ 


quently in the literature before (cf. [ LOT12 | and the references therein). In this section, we consider 
an arbitrary undirected and weighted graph G = (V,E,w). 

Let w(v) := J2 (uv)ge''^('^’'^') weighted degree of vertex v. For a vertex set S CV, let 

•= t>e the sum of weighted degrees of vertices in S. We will refer to w{S) as the 

volume of set S. For S,T V, let w{S,T) := Yl(^uv)eE ueS weights of 

edges with one endpoint in S and the other endpoint in T. The volume-based conductance of S in 
G is defined as 


<fS\S) := 


w{S,V\S) 

wiS) 


Let (/>™*(G) := min 5 ,^„( 5 )<^(y )/2 Note that generally, for a d-bounded degree graph 

G, the definition of conductance of a set S we are using in the paper is slightly different from 
the volume-based definition of conductance of S given as above. However, in a weighted d-regular 
graph G = {V,E), these two definitions are identical. 

We let A be the adjacency matrix of the weighted graph G, and let D denote the diagonal 
matrix with D {v,v) = w{v). Let £ = I - denote the normalized Laplacian of G, 

and let A* denote the ith smallest eigenvalue of C. Cheeger’s inequality gives that 


Theorem A.3 ([ AM85, Alo86|, ^J89|). For any undirected and weighted graph G, it holds that 


A2/2 < ^'’°\ G ) < . 

B On distribution testers: Proof of Lemma 1^5.2| 


For the sake of completeness, we give here a proof of Lemma 3.2. 


The description of the algorithm tester for testing if Hp^Hl < o'/d 


Proof of Lemma [ 
or 11 III > a is very simple: 

1 . let Zy denote the number of pairwise self-collisions of the r samples from p*; 

2 . reject if and only if Zy > ^( 2 )^- 

The performance of the above algorithm is guaranteed by the first paragraph of the proof of 
Lemma 4.2 in ||CS1CI|] (that in turn is built on Lemma 1 in [|GROO|] ) by setting e = ^ there. It 

is proven that if r > 16y/n, with probability at least 1 — ^(DIIp^HI ^ < i( 2 )IIPwll 2 ' 

Therefore, with probability at least 1 — , if Hp^HI < fj then Zy < |(|)f < and the 

□ 


tester will accept; and if ||p^||| > u, then Zy > and the test will reject. 
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C On tightness of Lemma |5.^ 


We prove the following lemma to show that Lemma 5.3 is essentially tight for k 


2 and constant 


Lemma C.l. Let G = {V,E) be a d-regular graph composed of two parts A and B, each of size 
n/2. Let 4>g{A) = 4>g{B) = font < Let f := V 2 be the second eigenvector with unit-length of 
the Laplacian matrix C of G. Then 

max I ^ ifn- ^)^ ^ ifn- fv? 

u,vG.A u,v£B 

Proof. For any subset U, we define the potential of Lf to be 

pot{U) -=7^ iJ"^ ~ 

' ' u,v&U 


^ ’pout 

' - 24d3 • 


Let X := ^— t+i 2 d^(d+i) — show that at least one of pot(^4), pot(i?) is larger than x. 

Assume on the contrary that pot (A), pot (i?) < x. We will derive a contradiction to the fact 
that / is the second eigenvector of C. 

First, for any subset U, we define the center of U to be Ajj := —Then we have 


pot(17) = 2 ^(/„ - Auf . 
veu 

Now our assumption implies that 

^(/u - - f ’ '^if^~ - 2 ■ 

uGA uGB 


Furthermore, 


( 3 ) 


'^ifu - aa)^ + y^^ifu 

ueA ueB 


= ^/2-2^AA/n + |A|Ai-2 J]As/n + |i?|A| 

u&V uGA uGB 

= l-^{Al + Al) 

< X , (4) 


where the penultimate equation follows from the fact that 'BugA ~ 1^1 Aa, |A| = \B\ = n/2 and 
Ylufu — 1 since / is a unit vector. 

On the other hand, since / is the second eigenvector of C, then fu = 0. Furthermore, 

Aa + A^ = ~Cy2 f'^ f'^^ = 0 . (5) 

uGA uGB 

Therefore, by inequality (^) and equation (|5|), we have that at least one of Aa, A^ is positive. 
Wlog., we assume that Aa > 0. This further implies that Aa > and Ab < 

Let 0 ^ ^ ^ 1 that will be specified later. Let A^ ;= G A ; ifu — Aa)^ ^ 2^} and let 
Bi := {u £ B : {fu — A^)^ > 2 ^b[}- Then by our assumption of inequalities (|^, we know that 
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I^il < y\M a-iid \Bi\ < y\B\. We further define A 2 to be the subset in A \ Ai such that for any 
u € ^ 2 , at least one of its neighbors is contained in Ai or Bi. We define B 2 similarly. Since the 
maximum degree of vertices in G is at most d, we know that \A 2 \ + |i? 2 | ^ <^(1^11 + l-^iD- 

We call a vertex v bad if v belongs to {Ai U ^ 42 ) U {Bi U B2). Otherwise, we call v good. 
Note that the number of bad vertices is equal to \Ai U A2I + \Bi U S2I < (d + l)(|^i| + |d?i|) < 
(d + l)y(| 74 | + |i?|) = (d + l)yn. Also, the number of edges involving any bad vertices is at most 
d(d + l)yn. 

Now we let y = 3 ^^- Since the number of edges between A and B is e{A, B) = (j)outd\A\ > 
d(d + l)yn, there exists at least one edge, say (tt, v) G E, such that u £ A and v £ B and both u, v 
are good. 

Since u is good, we know that all of its neighbors are in A \ Ai ov B\ Bi. Let dA,dB denote 
the number of neighbors of u belonging to A \ Ai,i? \ i?i, respectively. By the fact that there 
exists at least one crossing edge (u, v), we know that d^ > 1. Note that for any vertex rc G A \ Ai, 
\fw - AaI < ^ 2 ^ = yj’ vertex w £ B\Bi, \U - Ab\ < We have 

that 


E 

'w:(w,u)gE 





X 


— {d—l){AA + \ —) + Ab + \ — 

yn 


= (d-2)A^ + d 


yn 



yn 


< (d--)A^ 

< d{l-2(j)out){AA- 

< d(l - X2)fu , 



yn 


where the second inequality follows by our choices of x and y (since we set x = ^— f+i 2 d^(d+i) ? 


obtain 2dA^ = < A a), the third inequality follows by our assumption that (pout < and 


in the last inequality we use the fact that A 2 < 2p{G) < 2pout- 

Now since / is the second eigenvector of C, that is, fC = A 2 /, we know that for each vertex u, 
Ew.{w,u)eE fw = d(l - X 2 )fu- This is a contradiction. □ 


Remark. Note that Lemma implies that if a graph is connected by two large clusters A, B, 
each of size n/2 and outer conductance pout, then for at least one cluster, say A, the average value 
of (v 2 (u) — V 2 (u))^ over all vertex pairs in A is large. More precisely, 

dp- 

Furthermore, if the inner conductance of each cluster is at least pin such that pout = 
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then for any t < 0(^p^) < we have 




E iip^-p 


t ||2 
u\\2 


uGA 




E “ Vj (?;))2 ( 1 - ^ 


usA i=l 


2 t 


- ^ E -V2(r;))2 A- 

I I u,v&A ^ 


A 2 

2 


2 t 


> n 

= Q 


nd? 


where the penultimate inequality follows from the inequality that A 2 < 2 (l)out and the last inequality 
follows from our choice of t. 

Therefore, the average value of ||p* — p^||| over all vertex pairs u,v in the cluster A is 
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